Easy Monotonic Policy Iteration

نویسنده

  • Joshua Achiam
چکیده

A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or Q-function may fail to improve performance—or worse, actually cause the policy performance to degrade. Prior work has addressed this for policy iteration by deriving tight policy improvement bounds; by optimizing the lower bound on policy improvement, a better policy is guaranteed. However, existing approaches suffer from bounds that are hard to optimize in practice because they include sup norm terms which cannot be efficiently estimated or differentiated. In this work, we derive a better policy improvement bound where the sup norm of the policy divergence has been replaced with an average divergence; this leads to an algorithm, Easy Monotonic Policy Iteration, that generates sequences of policies with guaranteed non-decreasing returns and is easy to implement in a sample-based framework.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Point-Based Policy Iteration

We describe a point-based policy iteration (PBPI) algorithm for infinite-horizon POMDPs. PBPI replaces the exact policy improvement step of Hansen’s policy iteration with point-based value iteration (PBVI). Despite being an approximate algorithm, PBPI is monotonic: At each iteration before convergence, PBPI produces a policy for which the values increase for at least one of a finite set of init...

متن کامل

Exploiting policy knowledge in online least-squares policy iteration: An empirical study

Reinforcement learning (RL) is a promising paradigm for learning optimal control. Traditional RL works for discrete variables only, so to deal with the continuous variables appearing in control problems, approximate representations of the solution are necessary. The field of approximate RL has tremendously expanded over the last decade, and a wide array of effective algorithms is now available....

متن کامل

Safe Policy Iteration

CONTRIBUTIONS 1. Theoretical contribution. We introduce a new, more general lower bound to the policy improvement of an arbitrary policy compared to another policy based on the ability to bound the distance between the future state distributions. 2. Algorithmic contribution. We define two approximate policy–iteration algorithms whose policy improvement moves toward the estimated greedy policy b...

متن کامل

APPENDIX A: Notation and Mathematical Conventions

ion, 31Affine monotonic model, 19, 162, 189,195Aggregation, 20Aggregation, distributed, 23Aggregation, multistep, 27Aggregation equation, 25Aggregation probability, 21Approximate DP, 24Approximation models, 24, 49Asynchronous algorithms, 30, 71, 90,186, 196, 235Asynchronous convergence theorem,74, 92Asynchronous policy iteration, 23, 77,<l...

متن کامل

Scheduling policy design for autonomic systems

Scheduling the execution of multiple concurrent tasks on shared resources such as CPUs and network links is essential to ensuring the reliable operation of many autonomic systems. Well-known techniques such as rate-monotonic scheduling can offer rigorous timing and preemption guarantees, but only under assumptions (i.e. a fixed set of tasks with well-known execution times and invocation rates) ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1602.09118  شماره 

صفحات  -

تاریخ انتشار 2016